CRC has the 3rd highest mortality rate
More effective methods to detect CRC are needed
Correlation between exosomes and tumorigenesis
miRNA and mRNA can serve as biomarkers - these we want to find!
Using the library GEOquery, the data was loaded -> no need to download any files
Both primary data and meta data was loaded
Data was already standardized
Hi
Fetch analyte.tsv & clinical.tsv from raw_/
Library TCGABiolinks is used to retrieve data from the GDC data portal
Function: retrieve_and_prepare()
GDCquery: Query to specify the data to get
GDCdownload: Downloading the samples from the query
Example:
miRNA data - 2 separate dataframes
mRNA data - Large SummarizedExperiment
Calculation of the normalization factors for the data (_log) with calcNormFactors and imputation of NAs using means.
Running a “universal” edgeR differential analysis function with a quasi likelihood model.
Statistics table:
# A tibble: 1,881 × 6
miRNA_ID logFC logCPM F PValue FDR
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 hsa-mir-135b 4.83 11.5 245. 4.11e-19 7.73e-16
2 hsa-mir-19b-2 2.99 11.6 235. 2.41e-15 1.81e-12
3 hsa-mir-590 4.43 11.2 154. 2.88e-15 1.81e-12
4 hsa-mir-374a 2.14 12.0 279. 6.03e-15 2.84e-12
5 hsa-mir-450b 5.22 11.0 191. 4.85e-14 1.57e-11
6 hsa-mir-19a 5.75 11.4 159. 5.02e-14 1.57e-11
7 hsa-mir-889 3.95 11.2 200. 1.41e-13 3.78e-11
8 hsa-mir-19b-1 2.37 11.7 308. 1.22e-12 2.86e-10
9 hsa-mir-708 2.61 11.5 184. 2.13e-12 4.45e-10
10 hsa-mir-96 3.24 11.2 113. 3.38e-12 6.36e-10
# ℹ 1,871 more rows
Final augmented dataset:
# A tibble: 75,240 × 3
miRNA_ID TCGA_ID log_reads
<chr> <chr> <dbl>
1 hsa-let-7a-1 TCGA-F4-6854-01A 14.5
2 hsa-let-7a-1 TCGA-AA-A00O-01A 13.0
3 hsa-let-7a-1 TCGA-DM-A28F-01A 15.4
4 hsa-let-7a-1 TCGA-NH-A6GC-01A 15.2
5 hsa-let-7a-1 TCGA-AA-A010-01A 13.8
6 hsa-let-7a-1 TCGA-AA-A00D-01A 12.4
7 hsa-let-7a-1 TCGA-AA-A00U-01A 11.3
8 hsa-let-7a-1 TCGA-D5-6922-01A 15.7
9 hsa-let-7a-1 TCGA-G4-6323-01A 15.2
10 hsa-let-7a-1 TCGA-F4-6703-01A 14.5
# ℹ 75,230 more rows
| TCGA mRNA | TCGA miRNA | GSE miRNA |
|---|---|---|
The TCGA and GSE datasets have different stages, and the data we used has a different sample size in each stage.
Although we were able to follow the article’s instructions, there are significant differences in our results. It might be brought on by some extra measures taken during data preprocessing, or by the authors’ sparse information. It would be wise to get in touch with the authors to inquire further about preprocessing and data retrieval. Overall, our analysis was carried out accurately, and the results did not indicate any grave errors. In addition data we used has diffrent amount of sample in each stages, and stages differ between TCGA and GSE datasets
R for Bio Data Science